2 Multi Linear Regression

1 Review

We have known for linear model (1)yt=β0+β1t+εt, where t=1,,n, εtN(0,σ2), β0,β1,logσUnif(C,C), the posterior is fβ0,β1|data(β0,β1)(S(β^0,β^1)S(β0,β1))n2, where S(β0,β1)=t=1n(yiβ0β1t)2.

For (2)yt=β0+β1t+β2t2+β3t3+εt, similarly we have fβ0,β1,β2,β3|data(β0,β1,β2,β3)[S(β^0,β^1,β^2,β^3)S(β0,β1,β2,β3)]n2, where S(β0,β1,β2,β3)=t=1n(ytβ0β1tβ2t2β3t3)2.

2 Vector Matrix Notation

Denote yn×1=(y1yn),β=(β0β1),X=[1112 1n] for (1), or β=(β0β3),X=[11111nn2n3] for (2).

2.1 Minimizer of Squared Loss

We can denote S(β)=||yXβ||2. Since S(β)=||yXβ||2=(yXβ)T(yXβ)=yTyyTXββTXTy+βTXTXβ,S(β)=XTyXTy+2XTXβ=2XT(Xβy). Solve for S(β)=0, we have β^=(XTX)1XTy, which minimizes S(β).

2.2 Pythagorean Identity

S(β)=||yXβ||2=||yXβ^+Xβ^Xβ||2=||yXβ^||2+||Xβ^Xβ||2=S(β^)+(β^β)TXTX(β^β).

The cross term vanishes because of β^.


Now we go back to fβ|data(β)(S(β^)S(β))n2=[S(β^)S(β^)+(β^β)TXTX(β^β)]n2=[1+(β^β)TXTXS(β^)(β^β)]n2.

3 Multivariate Normal & t-Distribution

X=(X1Xp)Np(μ,Σp×p),$$withjointdensity$$f(x1,,xp)=(12π)p1detΣexp[12(xμ)TΣ1(xμ)].

Suppose XNp(μ,Σ), Vχk2, denote T=μ+XμV/k=(μ1+X1μ1V/kμk+XkμkV/k)tp,k(μ,Σ) is defined as t-distribution.

Proposition

Density of T is fT(t1,,tp)[11+1k(tμ)TΣ1(tμ)]p+k2.

Note when k is large, tp,k(μ,Σ) is close to Np(μ,Σ).

Fact

If Ttp,k(μ,Σ) has components T1,,Tp, then for each j=1,,p, Tjt1,k(μj,Σjj), where μj is the j th component of μ and Σjj is the (j,j) th entry of Σ.

4 Back to Bayesian Inference

Therefore for second model, p=4, the posterior is t-distribution: β|datat4,n4(β^,S(β^)n4(XTX)1).

In previous notes we have unbiased estimator σ^=S(β^)n4 (note p has changed to 4 here), so β|datat4,n4(β^,σ^2(XTX)1).


When the degrees of freedom of the t-distribution is large, i.e. when n is large, we can approximate the second model with β|dataN4(β^,σ^2(XTX)1).

For general setting yt=β0+β1xt1+β2xt2++βmxim+εi,εii.i.dN(0,σ2), p=m+1 (as β=(β0,,βm) has m+1 components), so k=np=nm1. It's also clear that μ=β^, and 1kΣ1=XTXS(β^)Σ=S(β^)k(XTX)1=S(β^)nm1(XTX)1, therefore β|datatnm1,m+1(β^,S(β^)nm1(XTX)1). Similarly with p=4, denote σ^=S(β^)nm1, so (3.1)β|datatnm1,m+1(β^,σ^2(XTX)1). By fact, (3.2)βj|datatnm1,1(β^j,σ^2(XTX)j+1,j+1).
When n is large, (3.1) goes to Nm+1(β^,σ^2(XTX)1), (3.2) goes to N(β^j,σ^2(XTX)j+1,j+1).